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Abstract — To achieve reliability in distributed storage systems, 
data has usually been replicated across different nodes. However 
the increasing volume of data to be stored has motivated the 
introduction of erasure codes, a storage efficient alternative to 
replication, particularly suited for archival in data centers, where 
old datasets (rarely accessed) can be erasure encoded, while 
replicas are maintained only for the latest data. Many recent 
works consider the design of new storage-centric erasure codes 
for improved repairability. In contrast, this paper addresses 
the migration from replication to encoding: traditionally erasure 
coding is an atomic operation in that a single node with the whole 
object encodes and uploads all the encoded pieces. Although large 
datasets can be concurrently archived by distributing individual 
object encodings among different nodes, the network and com- 
puting capacity of individual nodes constrain the archival process 
due to such atomicity. 

We propose a new pipelined coding strategy that distributes the 
network and computing load of single-object encodings among 
different nodes, which also speeds up multiple object archival. We 
further present RapidRAID codes, an explicit family of pipelined 
erasure codes which provides fast archival without compromising 
either data reliability or storage overheads. Finally, we provide 
a real implementation of RapidRAID codes and benchmark its 
performance using both a cluster of 50 nodes and a set of Amazon 
EC2 instances. Experiments show that RapidRAID codes reduce 
a single object's coding time by up to 90%, while when multiple 
objects are encoded concurrently, the reduction is up to 20%. 

Index Terms — archival, migration, erasure codes, distributed 
storage 

I. Introduction 

Networked distributed storage systems such as Google file- 
system (GFS) H], Amazon S3 |2) or Hadoop file-system 
(HDFS) El spread data among several storage nodes and allow 
to scale out from hundreds to thousands of commodity storage 
servers able to accommodate the ever-growing volume of data 
to be stored. To ensure that data survives failures of some of 
the storage nodes, all data needs to be redundantly stored. The 
simplest way to introduce redundancy is to store multiple copies 
(or replicas) of each data across the system. But erasure codes, 
a more sophisticated type of redundancy, can provide equivalent 
or even better fault-tolerance than replication for significantly 
lower storage overhead |4|, and hence have increasingly been 
embraced in recent times in systems such as Microsoft Azure 
0, Hadoop FS (6), Q and the new version of the Google File 
System [8 1 among others. Typical choices of erasure codes used 
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in these systems have an overall overhead of 1.3x-1.5x the 
size of the original data Q, 0, 0, which allows to reduce 
up to 50% the typical overhead of storing three replicas. 

Although erasure codes have the potential to significantly re- 
duce storage costs in distributed storage systems, there are still 
two advantages of using replication to store newly introduced 
data: 

« Pipelined Insertion: Replication allows to easily pipeline 
the redundancy generation process: data being stored in a 
node can be simultaneously forwarded to a second node, 
and from this second node to a third, and so on (T), (5). 
Such pipelining process allows to distribute the redun- 
dancy generation costs among different nodes, achieving 
a high storage throughout as well as an immediate data 
reliability. 

« Data Locality: Freshly introduced data in the system is 
very likely to be accessed and used, e.g., in a batch process 
to carry out some analytics. Replicating the data in several 
storage nodes allows the task scheduler to exploit data- 
locality: jobs are scheduled on the same nodes where 
data is located iflOl . ifTTl . Such a scheduling strategy re- 
duces network latencies and increases data and processing 
throughputs. 

Due to these properties, distributed storage systems often 
store newly introduced data using replication, and rely on era- 
sure codes to archive older and infrequently accessed data [6|, 
0, lfl2l . Such a pragmatic design allows systems to enjoy the 
benefits of replication (fast data insertion, data locality, etc.) 
when the data is in frequent use, as well as that of erasure 
codes (high fault-tolerance for lower storage cost) when the 
data is not accessed regularly, but still needs to be preserved. 

The need to access a specific stored data reduces significantly 
within a short period of time 0, lfl2l . which justifies replacing 
the replicas by an erasure code based archival. This migration 
usually consists of an atomic operation where a single storage 
node obtains the entire data object (by downloading blocks 
from different nodes if needed), encodes it, and finally uploads 
various parity blocks to different storage nodes 0, after which 
the number of replicas can be safely reduced to one. Although 
the encoding of one data object using this naive approach is 
inherently centralized, large datasets (containing several data 
objects) can sometimes be concurrently encoded by distributing 
individual encoding operations across different nodes. This does 
not change the fact that the limited network and computing 
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capacity of individual nodes remain a bottleneck that slows 
down the whole archival process. 

While different aspects of erasure coding based distributed 
storage systems have been studied recently, which include 
maximizing the fault-tolerance of erasure codes |[T3l , Q4), 
reducing the costs of repairing failures lfT5l - lfT7l , or dedupli- 
cating encoded data [18], this paper instead looks at a relatively 
unexplored problem, that of the efficiency of the migration 
from replication to erasure codes, aiming at optimizing the data 
archival in distributed storage systems. 

Our main contributions are three-fold. 

(1) We propose a novel coding strategy that splits the single- 
object encoding operation into different tasks that can be 
concurrently executed in different nodes, thus distributing the 
network and computing load of the archival process across mul- 
tiple nodes, which in turn speeds up the archival process. Our 
new encoding scheme is inspired by the pipelined insertions 
used in replication: First, the encoding process is distributed 
among those nodes storing replicated data of the object to be 
encoded, which exploits data locality and saves network traffic. 
We then arrange the encoding nodes in a pipeline where each 
node sends some partially encoded data to the next node, which 
creates parity data simultaneously on different storage nodes, 
avoiding the extra time required to distribute the parity after 
the encoding process is terminated. 

(2) We further present RapidRAID codes, an explicit family 
of erasure codes that realizes the pipelined erasure coding idea 
and provides fast archival without compromising either on data 
reliability or on storage overhead. Interestingly, RapidRAID 
codes only require the existence of two object replicas to 
execute a pipeline encoding, which makes them suitable for 
archiving data in reduced redundancy systems. Additionally, 
RapidRAID codes offer flexible parameter choices to realize 
different storage overheads (up to 2x the size of the original 
data) and different data reliability guarantees. 

(3) We finally provide a real implementation of RapidRAID 
codes that we benchmark both in a small cluster of 50 HP 
ThinClients as well as in a set of Amazon EC2 virtual instances. 
Our experimental results show that RapidRAID coding reduces 
the coding time of single data objects by up to 90%, and by 
up to 20% for batch processing the coding of multiple objects. 
The benefits of RapidRAID codes are also visible when part of 
the network is congested. The presence of congested nodes has 
less detrimental effects on RapidRAID encoding times than on 
traditional encoding times. 

The rest of the paper is organized as follows. In Section [XT] 
we provide the basic background on distributed storage systems 
and classical erasure codes. In Section [III] we estimate the 
coding times of classical erasure codes and show how pipelined 
erasure coding speeds up the coding time by exploiting data 
locality. In Sections [IV] and [V] we present the family of 
RapidRAID codes and we experimentally evaluate its perfor- 
mance in Section[Vl1 Finally, Sections fVIII and lVIIII respectivelv 
present the related work and our conclusions. 



II. Background on Erasure Codes 

Distributed storage systems used in data centers have started 
to adopt a hybrid strategy for redundancy, where replicas of 
the newly inserted data are created, while erasure codes are 
preferred for archival of the same data once it does not need to 
be regularly accessed anymore, but still needs to be preserved. 
The number of replicas is then reduced. The use of erasure 
coding for archival increases the fault tolerance of the system 
while reducing storage overheads with respect to replication [5 1, 
though replication remains so far the best form of redundancy 
for new data since it is likely to be frequently manipulated. 

Formally, the encoding process takes k blocks of data and 
computes m parity blocks (or redundancy blocks), which will 
be stored in m other different storage nodes. In most cases, 
since it is unlikely to find data objects that were exactly split 
into k blocks during the insertion process, the k blocks used 
in the encoding process might belong to different data objects. 
For example, in some systems files from the same directory are 
jointly encoded |7[. 

An optimal erasure code in terms of the trade-off between 
storage overhead and fault tolerance is called a maximum 
distance separable (MDS) code, and has the property that the 
original object can be reconstructed from any k out of the 
n = k + m stored blocks, tolerating the loss of any m = n—k 
blocks. The notation "(n, k) code" is often used to emphasize 
the code parameters. Examples of the most widely used MDS 
codes are the Reed-Solomon codes. Such codes will be referred 
to as classical erasure codes, to distinguish them from newly 
designed erasure codes. 

We will denote a data object to be stored by a vector 
o = (pi,..., Ok) of k x I bits, that is each o,, i = 1, . . . , k, is a 
string of I bits. Operations are typically performed using finite 
field arithmetic, that is, the two bits {0, 1} are seen as forming 
the finite field F2 of two elements, while Oi, i = l,...,k, 
then belong to the binary extension field F 2 ( containing 2 l 
elements. Encoding of the object o is performed using an (nxk) 
generator matrix G such that G ■ o T = c T , to obtain an n- 
dimensional codeword c = (ci,...,c„), of size n x I bits. 
When the generator matrix G has the form G = [Ik, G'] T where 
Ik is the identity matrix and G' is a k x (n — k) matrix, the 
codeword c becomes c = [o, r] where o is the original object, 
and r contains the m x I bits of redundancy. The code is then 
said to be systematic, in which case the k parts of the original 
object remain unaltered after the coding process. The data can 
then still be read without requiring a decoding process. 

Due to the computational complexity of finite field arith- 
metic, erasure codes usually need to operate on small fields 
(with small I values) to guarantee a fast coding process. Usually 
F 2 s or F 2 is (/ = 8 or I = 16) are preferred due to their efficient 
manipulation using 8-bit and 16-bit CPU words. However, the 
size of the field also constrains the size of the object (which is 
of Ik bits) to be either 8fc or 16fc bits long, for relatively small 
values of k. In distributed storage systems where the k blocks 
are usually tens of megabytes long, the coding is handled per 
part. The coding process iteratively takes k input words (I bit 
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Fig. 1, Network flow required to encode a data object using a classical 
systematic (8,4) erasure code using nodes nl. . .118. The ® symbol denotes a 
coding operation. 



words) from each of the k original blocks to form a small object 
of size Ik, which can be easily encoded. 

III. Pipelining the Redundancy Generation Process 

One of the main drawbacks of classical erasure codes is that 
the encoding process is an atomic operation in that a single 
node has the responsibility to download k blocks (from any 
of the existing replicas), encode them, and finally upload the 
resulting m parity blocks to m other nodes [71 . In this case the 
encoding node becomes a network and computing bottleneck 
that slows down the whole coding process. 

To understand better why such an atomicity results in long 
encoding times, we depict in Fig. Q] an example of an object en- 
coding using a classical systematic (8, 4) erasure code: an object 

= (01,02,03, 04) is encoded into an 8-dimensional codeword 
c = (ci, . . . , cs) = (o, C5, . . . , eg). The node i (denoted by m 
on the figure) stores a replica of the raw data block c; = o L , 

1 = 1, . . . , 4. To migrate to an erasure encoded data, the node 
executing the encoding process (denoted by <£>) downloads 
the k — 4 original blocks from any of the existing replicas 
(here from node 1, . . . , 4), and computes the redundancy blocks 
Cg, . . . , c§ which are then uploaded to nodes 5 to 8. The number 
of transmitted blocks is n = 8, and it could have been reduced 
to n — 1 = 7 if the coding process were run for example in 
node 4, which already stored C4 locally. In this toy example, 
exploiting data locality could save a block transmission. 

To analytically obtain an estimate of the time required for 
encoding one object using a classical erasure code, we consider 
the best possible scenario and assume that the coding process 
is done in a streamlined manner, meaning that the coding node 
downloads in parallel all the k original blocks and starts to 
generate parity data immediately after receiving the first few 
bytes from each of the k source nodes (e.g., once the first k 
network buffers are filled). Concurrently with the encoding of 
this data, the coding node continues to receive data from the 
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Fig. 2. Network flow required to encode a data object using a (8,4) pipelined 
erasure code using node nl. . . 118. The ® symbol denotes a coding operation. 



k source nodes, and uploads the partially generated parity data 
to the m — 1 destination nodes. The time required to encode 
an object can then be approximated by: 



^classical = Tbiock ■ max{fc, m - 1} + T c , assica i, 



(1) 



where Tbi oc k is the time needed to download a single data block 
under normal network conditions, and T c i as sicai represents the 
time required to generate parity data from the first k network 
buffers. Since the size of the blocks to be encoded are relatively 
large we will assume that the time required to transfer a 
block between two nodes is several orders of magnitude longer 
than the time required to partially encode an amount of data 
equivalent to the size of a network buffer: i.e., Tbi oc k 3> ^classical- 
One way to avoid the bottleneck of having a single coding 
node is to pipeline the creation of erasure code redundancy 
and distribute the redundancy generation costs among different 
storage nodes. The main idea behind our pipelined strategy is 
to take advantage of the fact that the data to be encoded is 
already spread and replicated over different nodes. Then, each 
of the nodes with one of the replicas can combine the data 
it stores with data from other nodes to generate part of the 
final codeword c. In Fig. [2] we depict an example of this idea 
using the same code parameters (8,4) used in Fig. Q] though 
here we do not insist on the code being systematic. Nodes 1 
to 4 store together a replica of the stored object o as before 
(that is node i stores Oi, i = 1, ... ,4) but this time nodes 5 
to 8 store a second replica of the same object as well. The 
coding process proceeds as follows. The first node sends a 
multiple of o\ to the second node. The node 2 computes a 
linear combination of this multiple of 0\ with o 2 and forwards 
the result to the node 3. The node 3 has now its own data 
03, and again computes a linear combination of 03 with what 
it received. The process is iteratively repeated from node i to 
node i = 1, . . . , 7. Simultaneously to this pipeline process, 
each node also generates its own redundancy block a, based on 
what it owns and receives, which does not have to be the same 
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linear combination as that sent to the next node. The set of 
all the locally generated blocks constitutes the final codeword 
c= (ci,...,c 8 ). 

Assuming that only two replicas of o are used in the 
process, the maximum length of the final codeword should be 
constrained to n = 2k, although we will see in next sections 
that any n < 2k is possible. Additionally, note that the coding 
process only requires to transmit seven temporal blocks (in 
general n — 1 blocks are transmitted) which entails the same 
network traffic as a classical encoding process. However, the 
coding time for the pipelined strategy is significantly reduced. 
In this case we can measure the coding time as the time required 
to transmit one block Tbi oc k, plus n — 1 times the delay taken 
to receive and encode a network buffer, denoted by T p j pe (we 
assume here the same streamlined coding strategy): 

Tpipe = tbiock + (n - l)T p i pe . (2) 

Similarly, due to the large size of the blocks being encoded 
(of the order of tens of megabytes) we can also assume that 
the time required to transfer a block between two nodes is 
several orders of magnitude longer than the time required to 
partially encode an amount of data equivalent to the size of a 
network buffer: T b i oc k > 7" pipe . However, since T b i oc k > Tp ipe 
and rbi oc k 3> ^classical* it is easy to see when we compare (T} 
and (f2]i that the factor max{fc, m — 1} in (flj makes T c i asS i ca i 
several times larger than T p i pe . In Section [Vj we will support 
this claim with real experiments. 

One possible criticism of the pipelined coding strategy is 
that unlike in classical erasure codes, the generated codeword 
does not contain a raw copy of the original data (i.e., it is not a 
systematic code). The immediate consequence is that accessing 
stored data will always require a decoding operation, which 
always comes with an associated CPU overhead. However, the 
benefits of a fast and less CPU-demanding encoding process (as 
we will see in Section VI) outweighs the relative inefficiency 
of data access, since the latter is infrequent. Furthermore, 
empirical studies have shown how erasure encoded data can be 
accessed with relatively low latencies, even when data needs 
to be decoded Q, and this latency can be further ameliorated 
by adopting pipelined decoding operations (faster than classical 
decoding operations), which are not reported here because of 
space restrictions. 

IV. RapidRAID: Motivating Examples 

In this section we present RapidRAID codes, an explicit 
family of erasure codes that realize the idea of pipelined erasure 
codes presented in the previous section. We first illustrate 
the code construction through two simple examples, and in 
Section [V] we formalize the definition of RapidRAID codes. 

A. Example for n = 2k 

We continue with an (8, 4) erasure code, as used in the 
previous section. An object o = (01,02,03,04), Oj G ¥ 2 i, 
of k = 4 blocks is stored over n = 8 nodes using a codeword 



c = (ci, . . . , cs), and two replicas of o are initially scattered as 
follows (this is the same original placement as that of Fig. |2j: 

node 1: o\, node 2: 02, node 3: 03, node 4: 04, 
node 5: 01, node 6: 02, node 7: 03, node 8: 04. 

Based on this replica placement, we split the RapidRAID 
coding process in two phases: 

Phase 1 (vertical coding): Following the pipeline depicted in 
Fig. |2] node 1 forwards some multiple of o\ to node 2, which 
computes a linear combination of the received data with 02, and 
forwards it again to node 3, and so on. More generally, node i 
encodes the data it gets from the previous node together with 
the data it already has and forwards it to the next node. We 
denote the data forwarded from node i to its successor, node 
j, by Xij, which is defined as follows: 

Xl,2 = Oltpl, 

X2.3 = Xlfi + O 2 02 = OlV>l + O 2 2 , 

%3A = ^2,3 + O 3 03 = Ottp! + O 2 2 + O 3 -03, 

^4,5 = X 3A + 0404 

= 0101 + O 2 02 + 03-03 + O 4 -04, 

••£5,6 = 2:4,5 + Oi0 5 

= Ol(0l + -0 5 ) + O 2 02 + O 3 3 + O 4 4 , 

X&, 7 = X 5fi + O 2 06 

= Oi(0i + 5 ) + O 2 (0 2 + 06) + O 3 03 + O 4 04, 

^7,8 = ^6,7 + O 3 07 

= Ol(0l + 5 ) + O 2 (02 + 06) + O 3 (03 + 07) + O 4 04, 

where tjjj G F 2 i, j = 1, . . . , 7, are predetermined values. 

Phase 2 (horizontal coding): Each of the n involved nodes also 
generates an element of the final codeword a by encoding the 
received data together with the locally stored data as follows: 

ci = 01&, 

C2 = X\,2 + O2C2 = 0101 + o 2 &, 

C3 = ^2,3 + O36 = 0101 + O 2 02 + 03^3, 

C4 = 2=3,4 + 04^4 = 0101 + O 2 02 + O 3 3 + 04^4, 

C.5 = 2:4,5 + Oi£ 5 

= Oi(0i + £5) + O 2 02 + O 3 03 + O 4 04, 
C6 = 2-5,6 + O2& 

= Oi(0i + 5 ) + O 2 (02 + + O 3 03 + O 4 04, 

C7 = 2; 6 ,7 + o 3 £ 7 

= Ol(0l + 5 ) + O 2 (02 + 06) + O 3 (03 + &) + O 4 04, 

cs = 2:7,8 + o 4 £s 

= Ol(0l +0 5 ) +O 2 (02 +06) +O 3 (03 +07) +O 4 (04 +6), 

where £j G F 2 i, j = 1, . . . , 8, are also predetermined values. 

Although we defined the coding process using two logically 
different phases, we want to highlight that when the coding 
process is implemented as a streamlined process, both phases 
can be executed simultaneously: as soon as node i receives the 
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first few bytes of aij-i,, it can start generating the first bytes 
of Cj, and concurrently forward a^i+i to node i + 1. 

Z?. Object Reconstruction and Fault Tolerance 

Using the notation of Section [II] we can express the 
RapidRAID coding process of the (8,4) example using the 
standard linear coding notation G ■ o T = c T as 
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It is easy to see that we can use the Gauss elimination method 
to reconstruct the original object, o, from any subset of four 
linearly independent symbols of c. Maximizing the number of 
linearly independent 4-subsets in c can be done by exhaustive 
computational search of the values taken by 0j and £j once the 
size 2 l of the field is fixed. When all fc-subsets in c are linearly 
independent, the code then becomes MDS, which achieves the 
highest possible fault tolerance given any fc and n. Note that 
the larger the field F 2 ; , the more likely it is to remove the linear 
dependencies within c lfl9l . 

However, even by selecting the optimal values of 0j and £j, 
there could be some intrinsic dependencies introduced by the 
pipelined coding process itself that cannot be removed. In the 
example of the (8,4) code proposed, from all the (^) = 70 
possible 4-subsets, there is one single linearly dependent 4- 
subset, namely {c\, C2, C5, which cannot be removed, no 
matter the values taken by ^ and 0, in F 2 (, for any I. Recall 
that 2 = in F 2 i. Then the following linear combination of 
ci, c 2 , C5 and cq always evaluates to zero: 

ci [(0i ^e^ 1 + 05 + + °2 + c 5 + c 6 

= O^l^l^ 2 + 05 + + (0101 + 026)6»£2 1 

+ (O101 + Oif 5 + O 2 02 + 03 03 + O 4 04) 
+ (oi0i + 0105 + O 2 02 + O263 + O 3 03 + O 4 04) 

= Oi£i0 1 £ 6 £0 1 £0 1 +Oi£i?/>5£r 1 +Ol6C5Cr 1 +Ol0l^6^2~ 1 
+O26^6^0 1 + 0101 + 01^5 + O 2 02 + O 3 3 + O 4 04 
+0101 + °105 + O 2 02 + 2 ^ 6 + O 3 3 + O404 = 0. 

It shows that the code is not an MDS code: if all the redundant 
blocks but {ci, C2, C5, c$} fail, it will be impossible to recover 
the original data o. In Section|V]we will analyze in detail which 
are the (n, fc) values that allow to obtain MDS codes, and for 
the rest of (n, k) values, we will quantify the impact that the 
non-MDS property has on the overall data reliability. 

C. Example for n < 2k 

The previous (8,4) code example enjoys a symmetric con- 
struction inherited from n = 2k, but we can extend the 
RapidRAID coding scheme for n < 2k. As an example we 



consider the case of a (6,4) code, which requires replicas of o 
to be initially overlapped on the n = 6 nodes as follows: 

node 1: oi, node 3: 03,01, node 5: 03, 
node 2: o 2 , node 4: 04,02, node 6: 04, 

The rest of the coding process continues as previously ex- 
plained. The basic difference will be on the computation made 
by nodes 3 and 4, which in this case corresponds to: 

Z3,4 = X 2 ,3 + O 3 03 + 0104, C 3 = £2,3 + O3& + Oi£ 4 , 
£4,5 = X 3A + 0405 + O 2 06, C 4 = £3,4 + 4 £ 5 + Oi£ 6 . 

Note that some of the subindexes of coefficients and £ might 
need to be altered accordingly. 

V. RapidRAID: General Definition 

Inspired by the examples of previous section, we now present 
a general definition of RapidRAID codes for any pair (n, k) of 
parameters, where n < 2k. We start by stating the requirements 
that RapidRAID imposes on how data must be stored: 

• As shown in the (6,4) example code, when k < 2k two of 
the stored replicas should be overlapped between n storage 
nodes: a replica of o should be placed in nodes 1 to k, 
and a second replica of o in nodes from n — k to n. 

• The final n redundancy blocks forming c have to be 
generated (and finally stored) in nodes that were already 
storing a replica of the original data. 

We then formally define the temporal redundant block that 
each node i in the pipelined chain sends to its successor as: 
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Xi-l 



i + 1 <*<«■- (3) 



o.Gnodc i 



with xo.i = 0, while the final redundant block q gener- 
ated/stored in each node i is: 

q = Xj-i^ + ^2 Oj£i, 1 < i < n, (4) 

Oj 6 node i 

where 0j,£j G F 2 ( are static predetermined values specifically 
chosen to guarantee maximum fault tolerance. 

A. Fault Tolerance Analysis 

As we already mentioned, the fault tolerance of the code 
depends on the number of linearly independent blocks within 
the codeword c. Optimally, if the code is MDS, all the (^) 
fc-subsets of c are linearly independent. In practice, achieving 
the MDS property is not always possible due to different types 
of linear dependencies generated during the construction of the 
RapidRAID code. We distinguish two different types of these 
linear dependencies: 

1) Natural dependencies are introduced by the pipelined 
coding process itself and cannot be removed, no matter 
the values taken by £j and ipi. 

2) Accidental dependencies appear due to a bad choice of 
the values of ^ and 0j. 

To evaluate the fault tolerance of an (n, k) RapidRAID 
code, we need to count the different linear dependencies in its 
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TABLE I 

Static resiliency of three different redundancy schemes (in 
number of 9's) for different probabilities of node failure p. 
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(b) Number of linearly dependent fc-subsets. 



Fig. 3. Evaluation of the linear dependencies in (n, fc) RapidRAID codewords. 
We consider three n values with all the possible fc values such that ^ < fc < n. 



codewords. We first analytically detect natural dependencies 
by enumerating all the possible fc-subsets, and for each fc- 
subset, we determine by symbolic computation whether it 
contains linear dependencies. Once we know that there is no 
linear dependency, we pick values of an d £i so as to 
avoid accidental dependencies. This can be done at random 
for relatively large fields such as F 2 ie, where almost any 
random set of coefficients guarantees the absence of accidental 
dependencies 1(191 . For small fields like F 2 «, finding a set of 
coefficients without accidental dependencies might require long 
exhaustive searches. 

Such an enumeration of all possible fc-subsets is feasible 
only for small values of n, due to the fast growth of the 
number (^) of fc-subsets to test. In Fig. [3] we computed the 
number of natural linear dependencies of (n, fc) RapidRAID 
codes with n € {8, 12, 16}, and all the possible values of fc, 
^ < fc < n. In Fig. [3a] we show the percentage of linearly 
independent fc-subsets and in Fig. [3b] the absolute number 
of linearly dependent fc-subsets. We observe that RapidRAID 
codes achieve the MDS property when fc > n — 3. 

After analyzing all the RapidRAID codes for n < 16, we 
propose the following conjecture: 

Conjecture 1: An (n, fc) RapidRAID code as defined by (01 
and is maximum distance separable (MDS) if fc > n — 3. 

However, we would like to highlight that some of the non- 
MDS codes (when fc < n — 3) still achieve high percentages 
of linearly independent fc-subsets. This is the case for example 
of a (16,11) RapidRAID code, evaluated later in this paper. 
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To complete the fault tolerance analysis of (n, fc) RapidRAID 
codes, we now consider their static resilience, which is the 
probability of being able to reconstruct a given stored object 
when a fraction p of random storage nodes fail. This static 
resilience for different node failure probabilities using the 
"number of 9's" metricQ is shown in Table[I] where we compare 
three different codes: (i) a (16,11) RapidRAID code, which 
is non-MDS, (ii) a (16,11) classical MDS code, and (hi) the 
standard replication scheme with three replicas. We see that 
although the static resilience of the RapidRAID code is slightly 
lower than the classical erasure code, for storage systems with 
low node failure probabilities (p < 0.01), RapidRAID codes 
achieve at least the same resiliency as the de-facto standard 
3-way replication scheme. According to data center studies 
published in 1201 . 12TI . the annualized failure rate (AFR) of 
modern hard disk drives (HDD) is in the range of 2% to 5%, 
depending on the age of the disk. Since the time required to 
repair a disk failure (which includes the time to detect the disk 
failure plus the time to repair and restore the missing data) is 
in the range of minutes or a few hours l20l . it is reasonable 
to expect less than 1% of simultaneous disk failures, making 
the RapidRAID codes family an attractive alternative to replace 
classical erasure codes in data centers. Besides, the actual trend 
in datacenters is to use solid state disks (SSD), which have even 
lower AFRs as compared to traditional HDD. Further note that 
the actual chance of data loss is much lower than the values 
indicated by static resilience analysis if the system is repaired 
and thus faults are not allowed to accumulate. 

VI. Evaluation 

In this section we evaluate the coding performance of a 
RapidRAID code and compare it with that of classical erasure 
coding. The code that we choose for the evaluation is a (16,11) 
code, with parameters similar to those used in real distributed 
storage systems 0, which offer a data reliability comparable 
to a (16,11) classical erasure code (see Table [Q. 

A. Implementation and Testbed 

In order to fairly compare coding times of RapidRAID 
codes with those of classical erasure codes, we developed an 
experimental distributed storage systerrQ which consists of a 
fast Python server infrastructure providing basic store/retrieve 
operations, as well as finite field arithmetic required to encode 
and forward data in pipelined erasure codes. The finite field 
arithmetic is implemented using the Jerasure l22l library, which 



For example, 'three nines' represents a probability of 0.999. 
2 Available online: https://github.com/llpamies/ClusterDFS 
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TABLE II 

Overall coding time of three (16,1 1) code implementations. 



CPU 


CEC 


RR8 


RR16 


Intel Atom (N280) 
1.66GHz; 512KB cache 


17.81 


5.06 


27.33 


Intel Xeon (E5645) 
2.40GHz; 12,288KB cache 


5.20 


3.50 


4.31 


Intel Core2 Quad (Q9400) 
2.66GHz; 3,072KB cache 


4.13 


1.47 


1.95 



contains a fast set of functions (optimized C code) designed to 
construct efficient erasure codes. 

Over our distributed storage system we integrated two dif- 
ferent erasure codes: 

• A (16,11) classical Reed-Solomon erasure code using 
Cauchy generator matrices, as it is already implemented in 
the Jerasure library. We adjust the erasure code parameters 
to guarantee maximum performance as it is suggested 
in l23l . which makes the Cauchy Reed Solomon code 
to clearly outperform other open source erasure coding 
libraries l23ll . We will refer to this code implementation 
as CEC {Classical Erasure Code). 

• A (16, 11) RapidRAID code implemented using the finite 
field arithmetic from Jerasure. This implementation can 
either work with 8 bit or 16 bit arithmetic, with operations 
in F 2 s or F 2 ie respectively. In each case the values of 
all (j>i and £j coefficients are chosen to maximize the 
obtained fault tolerance. We will refer to the 8bits and 
16bits RapidRAID implementations as RR8 and RR16 
respectively. 

In the case of RR8 the use of a small finite field makes 
it very difficult to find coefficient values guaranteeing the 
absence of accidental linear dependencies. In this case, the 8bit 
(16,11) RapidRAID implementation achieves data reliability 
values slightly lower than the ones depicted in Table Q] Despite 
this lower reliability, we include the 8bit implementation in our 
evaluation to show the effects that the word size has in coding 
times. Note that our RapidRAID implementation also includes 
a fast pipelined decoding mechanism that is not discussed here 
because of space restrictions. 

We evaluate the three coding settings, CEC, RR8 and RR16 
in both a small cluster of 50 HP t5745 ThinClient computers, 
and a set of 16 small instances in the Amazon EC2 cloud 
computing service. We will refer to the ThinClient and Amazon 
EC2 testbed as TPC and EC2 respectively. Finally, in all the 
experiments we assume that the size of all the k = 11 original 
blocks is of 64MB, which is the default block size in GFS and 
HDFS HI, 0. It means that the size of the original object to be 
stored is of 704MB (1 1 x 64MB), and the final erasure encoded 
object takes 1024MB (16x64MB), which represents a storage 
overhead of approximately 1.45 x the size of the original data. 

B. Computing Resource Usage 

Before evaluating coding times we will measure the overall 
computing requirements of the three evaluated codes. This 
metric is of special interest in datacenters where an archiving 



process requiring little overall computing resources is preferred 
due to the low interference it has on the normal datacenter 
operations. 

To measure the overall computing requirements of the CEC 
implementation we execute an encoding process where the 
k = 11 original blocks and the m = 5 parity are all stored 
in the local file system, avoiding all the network I/O. In that 
case the encoding time corresponds basically to the time the 
CPU is dedicated to execute the coding operations. Similarly, 
to measure the overall computing requirements of the RR8 and 
RR16 implementations, we run an encoding process where the 
execution of the n — 16 nodes occur in a single node, avoiding 
also all the network I/O. 

In Table [TT] we depict the average encoding time of the 
three encoding implementations when all the computing is 
executed in a single node and no network communication is 
involved. We show the results for three different CPUs. The 
first case (Intel Atom) corresponds to the execution time in 
the Thinclient computers, the second case (Intel Xeon) is an 
Amazon EC2 small instance, and the last one (Intel Core2) 
a personal desktop computer. Except in the case of Atom, 
both RapidRAID implementations require less CPU time to 
encode the same amount of data (i.e., 704MB) than the CEC 
implementation. In the case of the Atom CPU, due to the small 
size of the cache memory, the Jerasure library cannot allocate 
the whole lookup table required to perform F 2 ie arithmetic, 
which increases RR16 coding times as compared to RR8. 

We observe that RapidRAID codes can be computed faster 
than even one of the fastest implementation of classical erasure 
codes, and thus its impact on CPU usage is favorable. 

C. General Coding Times 

In Fig. 2] we measure the encoding times of the three 
different implementations for a single object encoding, as well 
as multiple object encodings. 

In Fig. @a] a single data object is encoded in a totally idle 
system. We see how the two RapidRAID implementations have 
of the order of 90% shorter coding times as compared to the 
classical erasure code implementation. In this case, by distribut- 
ing the network and computing load of the encoding process 
across 16 different nodes, RapidRAID codes significantly speed 
up the single data object's archival process. 

However, this speedup is obtained at the expense of involving 
16 nodes in the encoding process. It is then interesting to 
measure the encoding throughput of a classical erasure code 
involving the same number of nodes, i.e., when 16 encoding 
process are executed in parallel. In Fig. [4b] we depict the per- 
object encoding times obtained by executing 16 concurrent 
classical encoding processes and 16 RapidRAID encoding 
processes on a group of 16 nodes. In the EC2 setting, the 
two RapidRAID implementations achieve a reduction of the 
overall coding time by up to 20%. On the Thinclients, the 
16bit RapidRAID implementation requires around 50% longer 
coding times than classical erasure codes due to problems with 
the small cache size. 
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Fig. 4. Coding times of the three different code implementations. Each candle 
depicts the median value, the 25-75% percentiles and the max-min values. 



D. Coding Times in Congested Networks 

In practice, storage nodes might be executing other tasks 
concurrently with data archival processes, which might cause 
some nodes to experience network congestions that in turn 
might affect the coding times. Although nodes in the EC2 
setting are already virtual computers subjected to real network 
congestions, we needed to be able to arbitrary reduce the 
network capacity of some nodes to evaluate the potential 
effects that severe network congestions can have on RapidRAID 
coding times. To evaluate such effects of congestion, we use 
the Linux netem driver to introduce arbitrary congestions in our 
cluster of ThinClients. Specifically, we use netem to reduce the 
network bandwidth of some nodes from lGBps to 500MBps, 
and add to these nodes a 100ms network latency (with a 
deviation of up to ±10ms). 

In Fig. [5] we depict the effects that different network con- 
gestion levels have in coding times of the CEC and RR8 
implementations. Note that we only use the 8bits RapidRAID 
implementation due to the impossibility to run efficient F 2 ie 
arithmetic in the ThinClient cluster. In Fig.|5a]we show the time 
required to encode a single object. In the case of RapidRAID 
codes, coding times have a quasi-linear behavior when the 
number of congested nodes increases. However, in the case 
of classical erasure codes, we can see how a single congested 
node has major impacts to the coding times. Similarly, in 
Fig.[5b]we depict the per-object coding times of 16 concurrently 
encoded objects. Compared with the single object coding time, 
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(b) Concurrently encode 16 objects using a (16,11) code. 

Fig. 5. Average time required to encode a 16 concurrent objects using a (16,1 1) 
Cauchy Reed-Solomon code and a 8bit (16,11) RapidRaid code. Nodes have 
500Mbps connections with a latency of lOOmsilOms. Error bars depict the 
standard deviation value. 



the presence of a single congested node has even more impact 
on the coding times of classical erasure codes. In general these 
results show how classical erasure codes have a worse resilience 
to congested networks than RapidRAID codes. 

VII. Related Work 

Despite widespread use of erasure coding for archiving 
data in distributed storage systems, existing literature does 
not explore the process of migration from replication based 
redundancy to erasure code based redundancy. We thus discuss 
some peripherally related works. 

The most relevant related work is that of Fan et al. JT2), 
who propose to distribute the task of erasure coding using the 
Hadoop infrastructure, as MapReduce tasks. Any individual 



object is however encoded at a single node, and hence the 
parallelism achieved in their approach is only at the granularity 
of individual data objects. We note from our experiments 
that distributing the individual encoding tasks provide further 
performance benefits. 

Decentralized erasure coding has also been explored in the 
context of sensor networks |24| . However, in such a setting, 
the (disjoint) data generated by k sensors is jointly stored over 
n > k storage sensors based on erasure coding redundancy. 
This is achieved using network coding techniques, and is 
relatively straight forward to achieve, since random linear 
combinations of the already distributed data needs to be stored 
over the additional nodes. Such a technique is inapplicable for 
the problem considered in this paper. 

Li et al. ifTBl also used a similar pipelining based encoding 
strategy over a tree-structured topology to reduce the traffic 
required to repair lost redundancy. Redundancy replenishment 
is a very important and vigorously researched topic lfl5ll - lfT7l . 
however, as noted previously, it is an unrelated problem. 

VIII. Conclusions 

In this paper we introduced a novel pipelined erasure coding 
strategy to speedup the archival of data in distributed storage 
systems. We also presented RapidRAID, an explicit family 
of erasure codes that realizes the idea of pipelined erasure 
coding without compromising either data reliability or storage 
overheads. In particular, we showed that for equivalent storage 
overhead, RapidRAID codes can achieve a fault tolerance 
similar to that of existing erasure codes, and higher than 
replicated systems. Finally, we presented a real implementa- 
tion of RapidRAID codes, and experiments with real system 
benchmarks demonstrate the efficacy of our proposed solution. 
For coding a single object, our approach achieved up to 
90% reduction in time, while even when multiple objects are 
encoded, our approach is up to 20% faster than distribut- 
ing classical erasure coding tasks for different objects. The 
benefits of RapidRAID codes are visible even when part of 
the network is congested, where RapidRAID codes enable 
shorter coding times and have a better scalability as compared 
to existing erasure codes when the network congestion in- 
creases. The current implementation source code is available at 
https://github.coni/Ilpamies/ClusterDFS. 

The design of pipelined erasure coding based RapidRAID 
codes is an important step towards more efficient mechanisms 
to archive "big-data" in distribute storage systems. As part of 
our future research, we aim to explore the performance of 
RapidRAID codes under different choice of code parameters 
k and n. It is specially challenging for large values of n where 
numerical evaluation of the fault tolerance becomes intractable. 
We also aim to explore how RapidRAID codes can be gener- 
alized to exploit the existence of more than two replicas, and 
particularly for the special case of three replicas, which is the 
de facto redundancy scheme used in most production systems. 
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